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Abstract: In this paper, we consider the problem of partitioning a small 
data sample drawn from a mixture of k product distributions. We are inter- 
ested in the case that individual features are of low average quality 7, and 
we want to use as few of them as possible to correctly partition the sample. 
We analyze a spectral technique that is able to approximately optimize the 
total data size — the product of number of data points n and the number 
of features K — needed to correctly perform this partitioning as a function 
of 1/7 for K > n. Our goal is motivated by an application in clustering 
individuals according to their population of origin using markers, when the 
divergence between any two of the populations is small. 
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1. Introduction 

We explore a type of classification problem that arises in the context of compu- 
tational biology. The problem is that we are given a small sample of size n, e.g., 
DNA of n individuals (think of n in the hundreds or thousands), each described 
by the values of K features or markers, e.g., SNPs (Single Nucleotide Polymor- 
phisms, think of K as an order of magnitude larger than n). Our goal is to use 
these features to classify the individuals according to their population of origin. 
Features have slightly different probabilities depending on which population the 
individual belongs to, and are assumed to be independent of each other (i.e., our 
data is a small sample from a mixture of k very similar product distributions). 
The objective we consider is to minimize the total data size D = nK needed 
to correctly classify the individuals in the sample as a function of the "average 
quality" 7 of the features, under the assumption that K > n. Throughout the 
paper, we use p{ and ^{ as shorthands for pf"^ and ^''p respectively. 
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Statistical Model: We have k probability spaces fii, . . . , il/j over the set 
{0, 1}^. Further, the components (features) of z g fit are independent and 
Pr^t [zi = 1] = pI {I < t < k, 1 < i < K). Hence, the probabihty spaces 
r^i, . . . , fife comprise the distribution of the features for each of the k popula- 
tions. Moreover, the input of the algorithm consists of a collection (mixture) 
of n = J2t=i^t unlabeled samples, Nt points from fit, and the algorithm is 
to determine for each data point from which of fii, . . . ,51^ it was chosen. In 
general wc do not assume that Ni, . . . ,Nt are revealed to the algorithm; but we 
do require some bounds on their relative sizes. An important parameter of the 
probability ensemble Oi , . . . , fife is the measure of divergence 



between any two distributions. Note that \fK^ measures the Euclidean distance 

between the means of any two distributions and thus represents their separation. 
Further, let N = n/k (so if the populations were balanced we would have N 
of each type) and assume from now on that kN < K. Let D = nK denote the 
size of the data-set. In addition, let = maxi_tP((l ~pl) denote the maximum 
variance of any random bit. 

The biological context for this problem is we are given DNA information 
from n individuals from k populations of origin and we wish to classify each 
individual into the correct category. DNA contains a series of markers called 
SNPs, each of which has two variants (alleles). Given the population of origin 
of an individual, the genotypes can be reasonably assumed to be generated by 
drawing alleles independently from the appropriate distribution. The following 
theorem gives a sufficient condition for a balanced (A'^i = N2) input instance 
when fc = 2. 

Theorem 1.1 (Zhou (2006)). Assume Ni = N2 ^ N. If K = n{^-^) and 
KN ^ f^( i"^^i°si°s^ ) then with probability 1 - l/poly(Ar), among all balanced 
cuts in the complete graph formed among 2N sample individuals, the maximum 
weight cut corresponds to the partition of the 2N individuals according to their 
population of origin. Here the weight of a cut is the sum of weights across all 
edges in the cut, and the edge weight equals the Hamming distance between the 
bit vectors of the two endpoints. 

Variants of the above theorem, based on a model that allows two random 
draws from each SNP for an individual, are given in Chaudhuri ct al. (2007); 
Zhou (2000). In particular, notice that edge weights based on the inner-product 
of two individuals' bit vectors correspond to the sample covariance, in which 
case the max-cut corresponds to the correct partition Zhou (2006) with high 
probability. Finding a max-cut is computationally intractable; hence in the same 
paper Chaudhuri ct al. (2007), a hill-climbing algorithm is given to find the 
correct partition for balanced input instances but with a stronger requirement 
on the sizes of both K and nK . 



7 = mm 

l<s<t<k 



(1.1) 



K 
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A Spectral Approach: In this paper, wc construct two simpler algorithms 
using spectral techniques, attempting to reproduce conditions above. In partic- 
ular, we study the requirements on the parameters of the model (namely, 7, N, 
k, and K) that allow us to classify every individual correctly and efficiently with 
high probability. 

The two algorithms Classify and Partition compare as follows. Both algo- 
rithms are based on spectral methods originally developed in graph partitioning. 
More precisely, Theorem 1.2 is based on computing the singular vectors with 
the two largest singular values for each of the n x K input random matrix. The 
procedure is conceptually simple, easy to implement, and cSicicnt in practice. 
For simplicity, Procedure Classify assumes the separation parameter 7 is known 
to decide which singular vector to examine; in practice, one can just try both 
singular vectors as we do in the simulations. Proof techniques for Theorem 1.2, 
however, are diSicult to apply to cases of multiple populations, i.e., k > 2. Pro- 
cedure Partition is based on computing a rank-A: approximation of the input 
random matrix and can cope with a mixture of a constant number of popula- 
tions. It is more intricate for both implementation and execution than Classify. 
It does not require 7 as an input, while only requires that the constant k is 
given. We prove the following theorems. 

Theorem 1.2. Let ui ~ m'"(-'Vi.-'V2) Wmin be a lower hound on ui. Let 7 be 
given. Assume that K > 2nlnn and k ~ 2. Procedure Classify allows us to 
separate two populations w.h.p., when n > il ( ^J' . ) ; where is the largest 
variance of any random bit, i.e. cr^ = maxi_tpj(l — pj). Thus if the populations 
are roughly balanced, then n > ^ suffices for some constant c. 

This implies that the data required is I? = 7iK — O (lnnCT^/7^a'^a'^i„). Let 
Ps = {pl)i=i.,...,K, we have 



\\P,-P2\\, = ^=. Y.^P\~P\Y> VhT^. (1.2) 

Theorem 1.3. Let u = '""'^^'-'^^^ There is a polynomial time algorithm 
Partition that satisfies the following. Suppose that K > nlogn, 7 > , 
n > for some large enough constant Ck, and uj = f2(l). Then given the 

empirical n x K matrix comprising the K features for each of the n individuals 
along with the parameter k. Partition separates the k populations correctly 
w.h.p. 

Summary and Future Direction: Note that unlike Theorem 1.1, both The- 
orem 1.2 and Theorem 1.3 require a lower bound on n, even when k = 2 and 
the input instance is balanced. We illustrate through simulations to show that 
this seems not to be a fundamental constraint of the spectral techniques; our 
experimental results show that even when n is small, by increasing K so that 
nK = f2(l/7^), one can classify a mixture of two populations using ideas in Pro- 
cedure Classify with success rate reaching an "oracle" curve, which is computed 
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assuming that distributions arc known, where success rate means the ratio be- 
tween correctly classified individuals and N. Exploring the tradeoffs of n and 
K that are sufficient for classification, when sample size n is small, is both of 
theoretical interests and practical value. 

Outline of the paper: The paper is organized as follows. In Section 1.1 we 
discuss related work. Then, in Section 2 we describe the algorithm Classify for 
Theorem 1.2 and outline its analysis. Some (very) technical details of the analy- 
sis are deferred to the appendix. Section 3 deals with the algorithm Partition 
for Theorem 1.3. Finally, in Section 4 we report some experimental results on 
Classify. 

1.1. Related work 

In their seminal paper Pritchard et al. (2000), Pritchard, Stephens, and Don- 
nelly presented a model-based clustering method to separate populations using 
genotype data. They assume that observations from each cluster are random 
from some parametric model. Inference for the parameters corresponding to 
each population is done jointly with inference for the cluster membership of 
each individual, and k in the mixture, using Bayesian methods. 

The idea of exploiting the eigenvectors with the first two eigenvalues of the 
adjacency matrix to partition graphs goes back to the work of Fiedler Fiedler 
(1973), and has been used in the heuristics for various NP-hard graph parti- 
tioning problems (e.g., Fjallstrom (1998)). The main difference between graph 
partitioning problems and the classification problem that we study is that the 
matrices occurring in graph partitioning are symmetric and hence diagonaliz- 
able, while our input matrix is rectangular in general. Thus, the contribution of 
Theorem 1.2 is to show that a conceptually simple and efficient algorithm based 
on singular value decompositions performs well in the framework of a fairly gen- 
eral probabilistic model, where probabilities for each of the K features for each 
of the k populations are allowed to vary. Indeed, the analysis of Classify re- 
quires exploring new ideas such as the Separation Lemma and the normalization 
of the random matrix X, for generating a large gap between top two singular 
values of the expectation matrix X and for bounding the angle between random 
singular vectors and their static correspondents, details of which are included 
in Section 2 with analysis in full version. 

Procedure Partition and its analysis build upon the spectral techniques of 
McSherry (2001) on graph partitioning, and an extension due to Coja-Oghlan 
(2006). McSherry provides a comprehensive probabilistic model and presents a 
spectral algorithm for solving the partitioning problem on random graphs, pro- 
vided that a separation condition similar to (1.2) is satisfied. Indeed, McSherry 
(2001) encompasses a considerable portion of the prior work on Graph Coloring, 
Minimum Bisection, and finding Maximum Clique. Moreover, McSherry's ap- 
proach easily yields an algorithm that solves the classification problem studied 
in the present paper under similar assumptions as in Theorem 1.3, provided 
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that the algorithm is given the parameter 7 as an additional input; this is ac- 
tually pointed out in the conclusions of McSherry (2001). In the context of 
graph partitioning, an algorithm that does not need the separation parameter 
as an input was devised in Coja-Oghlan (2006). The main difference between 
Partition and the algorithm presented in Coja-Oghlan (2006) is that Parti- 
tion deals with the asymmetric n x K matrix of individuals/features, whereas 
Coja-Oghlan (2006) deals with graph partitioning (i.e., a symmetric matrix). 

There are two streams of related work in the learning community. The first 
stream is the recent progress in learning from the point of view of cluster- 
ing: given samples drawn from a mixture of well-separated Gaussians (compo- 
nent distributions), one aims to classify each sample according to which com- 
ponent distribution it comes from, as studied in Dasgupta (1999), Dasgupta 
and Schulman (2000), Arora and Kannan (2001); Vempala and Wang (2002); 
Achlioptas and McSherry (2005); Kannan et al. (2005); Dasgupta et al. (2005). 
This framework has been extended to more general distributions such as log- 
concave distributions in Achlioptas and McSherry (2005); Kannan et al. (2005) 
and heavy-tailed distributions in Dasgupta et al. (2005), as well as to more than 
two populations. These results focus mainly on reducing the requirement on the 
separations between any two centers Pi and P2. In contrast, we focus on the 
sample size D. This is motivated by previous results Chaudhuri et al. (2007); 
Zhou (2006) stating that by acquiring enough attributes along the same set 
of dimensions from each component distribution, with high probability, we can 
correctly classify every individual. 

While our aim is different from those results, where n > K is almost universal 
and we focus on cases K > n, we do have one common axis for comparison, 
the ^2-distance between any two centers of the distributions. In earlier works 
Dasgupta and Schulman (2000); Arora and Kannan (2001), the separation re- 
quirement depended on the number of dimensions of each distribution; this has 
recently been reduced to be independent of K, the dimensionality of the dis- 
tribution for certain classes of distributions Achlioptas and McSherry (2005); 
Kannan et al. (2005). This is comparable to our requirement in (1.2) for the 
discrete distributions. For example, according to Theorem 7 in Achlioptas and 
McSherry (2005), in order to separate the mixture of two Gaussians, 



is required. Besides Gaussian and Logconcave, a general theorem: Theorem 6 
in Achlioptas and McSherry (2005) is derived that in principle also applies to 
mixtures of discrete distributions. The key difficulty of applying their theo- 
rem directly to our scenario is that it relies on a concentration property of the 
distribution (Eq. (10) of Achlioptas and McSherry (2005)) that need not hold 
in our case. In addition, once the distance between any two centers is fixed 
(i.e., once 7 is fixed in the discrete distribution), the sample size n in their al- 
gorithms is always larger than log^ K) Achlioptas and McSherry (2005); 
Kannan et al. (2005) for log-concave distributions (in fact, in Theorem 3 of 
Kannan et al. (2005), they discard at least this many individuals in order to 




(1.3) 
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correctly classify the rest in the sample), and larger than for Gaussians 

Achlioptas and McSherry (2005), whereas in our case, n < K always holds. 
Hence, our analysis allows one to obtain a clean bound on n in the discrete 
case. 

The second stream of work is under the PAC-learning framework, where 
given a sample generated from some target distribution Z, the goal is to output 
a distribution Zi that is close to Z in KuUback-Leibler divergence: KL{Z\\Zi), 
where Z is a mixture of product distributions over discrete domains or Gaussians 
Kearns ct al. (1994); Freund and Mansour (1999); Cryan (1999); Cryan ct al. 
(2002); Mossel and Roch (2005); Fcldman et al. (2005, 2006). They do not re- 
quire a minimal distance between any two distributions, but they do not aim to 
classify every sample point correctly either, and in general require much more 
data. 

Our work is also related to the use of principal component analysis ("PCA") 
in genetics Patterson ct al. (200G); Price ct al. (2006). The basic approach in 
these papers is to use the eigenvectors of a covariance matrix between samples 
to analyze a mixture of populations. While Patterson et al. (2006); Price ct al. 
(2006) study the use of spectral methods empirically, the crucial point of the 
present work is that we prove rigorously that spectral methods succeed on a 
certain (simple) probabilistic model. Hence, our work can be seen as a further 
theoretical justification of the practical use of PCA. A difference between the 
present paper and Patterson et al. (2006); Price et al. (2006) is that we actually 
aim to assign each individual to exactly one of the populations. By contrast, 
Patterson et al. (2006); Price et al. (2006) just assign each individual a real 
"weight" for each population: essentially the eigenvectors with the dominant 
eigenvalues corresponding to the populations, and each individual is assigned 
its projection on these dominant eigenvectors. The algorithm Classify is some- 
what similar to PCA, but Partition is conceptually more involved. In addition, 
our experimental results show a phase transition phenomenon similar to what 
was observed in Patterson et al. (2006) in detecting population structure using 
simulated data. 

2. A simple algorithm using singular vectors 

As described in Theorem 1.2, we assume we have a mixture of two product 
distributions. Let iVi,A^2 be the number of individuals from each population 
class. Our goal is to correctly classify all individuals according to their distri- 
butions. Let n = 2N = Ni + N2, and refer to the case when A^i = N2 as the 
balanced input case. For convenience, let us redefine "K" to assume we have 
O(logn) blocks of K features each (so the total number of features is really 
0{K\ogn)) and we assume that each set of K features has divergence at least 
7. (If we perform this partitioning of features into blocks randomly, then with 
high probability this divergence has changed by only a constant factor for most 
blocks.) 

The high-level idea of the algorithm is now to repeat the following procedure 
for each block of K features: use the K features to create annxK matrix AT, such 



A. Blum et al. /Separating populations with wide data: A spectral analysis 



83 



that each row Xi,i = 1 , . . . , n, corresponds to a feature vector for one sample 
point, across its K dimensions. We then compute the top two left singular 
vectors mi,U2 of X and use these to classify each sample. This classification 
induces some probability of error / for each individual at each round, so we 
repeat the procedure for each of the O(logn) blocks and then take majority 
vote over different runs. Each round we require K > n features, so we need 
0{Tilogn) features total in the end. 

In more detail, we repeat the following procedure O(logn) times. Let T — 
i||^-y/3a;min7, where Wmin is the lower bound on the minimum weight min{|^, 
which is independent of an actual instance. Let si(X), S2{X) be the top two sin- 
gular values of X. 

Procedure Classify: Given 7, A^, Wmin- Assume that iV S> ^, 

• Normalization: use the K features to form a random nx K matrix X; Each 
individual random variable Xij is a normalized random variable based on 
the original Bernoulli r.v. bij £ {0, 1} with Pr[bij = 1] = p{ for Xi E Pi 
and Pr[6ij = 1] = for Xi £ P2, such that X,j = 

• Take top two left singular vectors ui, U2 of X, where Ui = . . . , Ui.„], i = 
1,2. 

1. If S2{X) > T — -i^-y/SwininT, ^sc 1*2 to partition the individuals 
with as the threshold, i.e., partition j e [n] according to U2j < 
or U2,j > 0. 

2. Otherwise, use ui to partition, with mixture mean M = X]r=i '"i," 
as the threshold. 

Analysis of the Simple Algorithm: Our analysis is based on comparing 
entries in the top two singular vectors of the normalized random n x K ma- 
trix X, with those of a static matrix X, where each entry Xij = E[Xi.j] is 
the expected value of the corresponding entry in X. Hence Vi — l,...,Ni, 

Xi = [fil, fij, . . . , ], where /i{ = Vj, and Vi = TVi + 1, . . . , n, A"; = 

[/i2,/i|, . . . ,/i2^'], where /i^ = i^^,Vj. We assume the divergence is exactly 7 
among the K features that we have chosen in all calculations. 

The inspiration for this approach is based on Lemma 2.1, whose proof is built 
upon Theorem A. 2 that is presented in a lecture note by Spiclman (2002). For 
annx K matrix A, let si{A) > S2{A) > • • • > Sn{A) be singular values of A. Let 
ui, . . . , Un, fi, . . . , v„, be the n left and right singular vectors of X, corresponding 
to si{X), . . . , Sn{X) such that \\ui\\^ = 1, \\vi\\2 = l,Vz. We denote the set of n 
left and right singular vectors of X with ui, . . . ,UmVi, . . . ,Vn- 

Lemma 2.1. Let X be the random n x K matrix and X its expected value 
matrix. Let A ~ X — X be the zero-mean random matrix. Let 6i be the angle 
between two vectors: [ui,Vi], [ui,Vi], where \\[ui,Vi]\\2 = ||[wi,Wi]||2 = 2 and [u,v] 
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represents a vector that is the concatenation of two vectors u,v. 

\\U, -U^\\2< \\[U^,V^] - [u^,V^]\\2 « 29, ^ 2 Sm{0,) < -^^^ ^ (2-1) 

where gap{i, X) = minj^; — 

We first bound the largest singular value si{A) ~ si{X — X) of {oij) with 
independent zero-mean entries, which defines the Euclidean operator norm 

IIK,)II :=sup|^a,.,x,y, < < l|. (2.2) 

The behavior of the largest singular value of an n x m random matrices A with 
i.i.d. entries is well studied. Latala (2005) shows that the weakest assumption 
for its regular behavior is boundedness of the fourth moment of the entries, even 
if they are not identically distributed. Combining Theorem 2.2 of Latala (2005) 
with the concentration Theorem 2.3 by Meckes (2004) proves Theorem 2.4 that 
we need ^ . 

Theorem 2.2 (Norm of Random Matrices Latala (2005)). For any finite nxm 
matrix A of independent mean zero r.v. 's a, j we have, for an absolute constant 
C, 



E II (a,,, ) II < C l^max Ea^^- + max Ea,?^- + Ea^, j j • (2.3) 

Theorem 2.3 (Concentration of Largest Singular Value: Bounded 
Range Meckes (2004)). For any finite nxm, where n < m, matrix A, such 
that entries a.ij are independent r.v. supported in an interval of length at most 
D, then, for all t, 

Pr[|si(A) - Msi(^)| >t]< 4e-*'/''^'. (2.4) 

Theorem 2.4 (Largest Singular Value of a Mean-zero Random Matrix). For 

any finite n x K , where n < K , matrix A, such that entries ai^j are independent 
mean zero r.v. supported in an interval of length at most D, with fourth moment 
upper bounded by B, then 



Pr 



si [A) > CB^^'^v^ + + t\ < 46-*'/^^ (2.5) 



for all t. Hence \\A\\ < CiB^^^^/K for an absolute constant Ci. 



^ One can also obtain an upper bound of 0{\/n + K) on si{A) using a theorem on by Vu 
(2005), through the construction a (n -|- K) X (n + K) square matrix out of A. 
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2.1. Generating a large gap in Si{X), S2{X) 

In order to apply Lemma 2.1 to the top two singular vectors of X and X through 
u,-u,, < ^ 2.6 

\si{X) ~ S2[X)\ 

4:Sl{X~X) 

inm{\si{X) - S2(X)\ ,\S2[X)\) 

we need to first bound |si(A') — S2('^)| away from zero, since otherwise, RHSs 
on both (2.6) and (2.7) become unbounded. We then analyze 

gap(2, A-) = min {\.si{X) - S2{X)\ , \s2{X)\) . 

Let us first define values a, 6, c that we use throughout the rest of the paper: 

a = X;(4)', & = Ea^^/^2, c = f;(M^)2. (2.8) 

fc=l k=l k=l 

For the following analysis, we can assume that a,b,c E [K/A, K], given that X 
is normalized in Procedure Classify. 

We first show that normalization of X as described in Procedure Classify 
guarantees that not only |si(<^) — S2('^)| 7^ 0, but there also exists a NK) 
amount of gap between si{X) and S2{X) in Proposition 2.5: 



gap(A') := \si{X) - S2{X)\ = eiVNK). (2.9) 

Proposition 2.5. For a normalized random matrix X , its expected value matrix 
X satisfies ^^ov^lvT? < gap{X) < V2NK, where cq = j^j^ is a constant, 
given that a,b,c€ [K/4:,K] as defined in (2.8). In addition, 



< si{X) < VWk, and < si{X) + S2{X) < VWk. (2.10) 

We next state a few important results that justify Procedure Classify. Note 
that the left singular vectors iZi, Vi of X are of the form [xi, . . . ,Xi,yi, . . . ,yi]'^: 

ui = [xi, . . . ,xi,yi, . . . ,yi]'^ , and M2 = [2:2, • ■ • , 2:2, y2, • ■ • , 2/2]^, (2.11) 

where Xi repeats A'^i times and yi repeats N2 times. We first show Proposition 2.6 
regarding signs of Xi, yi, i = 1,2, followed by a lemma bounding the separation 
of X2,y2- We then state the key Separation Lemma that allows us to conclude 
that least one of top two left singular vectors of X can be used to classify data 
at each round. It can be extended to cases when k > 2. 

Proposition 2.6. Let b as defined in (2.8): when b > 0, entries xi,yi in ui 
have the same .sign while X2 , y2 in U2 have opposite signs. 
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%ain where C^^n 



2/21 



< 



2N 



wh 



eve Cinax — uJi 



\y2\ 



> 



Lemma 2.8 (Separation Lemma). 



■'max 

Cy mil 

27V 



< 



where C, 



(^1 



> 



y mill 



Proof. Let A Pi - P2 as in Theorem 1.2, and 6 = [1, 0, . . . , 0, -1, 0, . . . , 0]^, 
where 1 appears in the first and —1 appears in the A^i + 1'''* positions. Then 
A = X'^h = [yu}-/i2;Mi-A'2 7 • ■ • i^f -("f ]■ Given X = si{X)uiv^ + S2{X)u2vl , 
we thus rewrite A as: A = X^b = si{X)viu\h + S2{X)v2U^h — si{X)vi{xi — 
yi) + S2{X)v2{x2 — 2/2)- The lemma follows from the fact that || Ajlj = ^/K^ and 
vi , V2 are orthonormal. □ 



Combining Proposition 2.6, Lemma 2.7, (2.10), and Lemma 2.8, we have 

and hence gap{2,X) 



^2NK-t 



Corollary 2.9. S2{X) < 

|si(A') — §2 (A") I) = S2{X) for a sufficiently small 7 



min(s2('^), 



In Section 2.2, we first prove a proposition regarding a, b, c as defined in (2.8). 
We next provide the proof for Theorem 2.4 regarding the largest singular value 
of {X — X). In Section 2.4, we show that the probability of error at each round 
for each individual is at most / = 1/10, given the sample size n as specified 
in Theorem 1.2. Hence by taking majority vote over the different runs for each 
sample, our algorithm will find the correct partition with probability 1 — 1/n^, 
given that at each round we take a set of K > n independent features. 



2.2. Detailed analysis for the simple algorithm 

Throughout the rest of the paper, we use X, Y, H to represent random matri- 

ces, where H = XX'^ and Y = q • We use X,y, H to represent the 

corresponding static matrices. Let us substitute a, 6, c in n = XX'^, where the 
blocks in H from top to bottom and from left to right are of size: Ni x Ni, Ni x 
N2,N2 X A^i and N2 x N2 respectively: 



n = XX' 



2Nx2N 

Proposition 2.10. For any choices of fj,^ , ac > b^ ; By definition, 

K 



c—2b = ^ al, wh 



ere ak 



\i4-i^2\ 



(2.12) 



(2.13) 
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Proof, a + c - 26 = Y.k "I 



;| holds by definition. 



K K / K 



2 



ac — b 



,2 



k=l k=l \k=l 



K 



fe=i j^k 




fe=l j^k 



+ (/.i/i^-)^ - 2^,1^^',^^i^^i = E(/4v4 - /4/4r > o. 




□ 



Remark 2.11. Both matrices of X and XX^ have rank at most two. When 
ac = b^ , Ti. has rank 1 . 

2.3. Proof of Theorem 2-4 

By having an upper bound on both maximum variance and fourth moment of 
any entry, we have the following corollary of Theorem 2.2. 

Corollary 2.12 (Largest Singular Value: Bounded Fourth Moment Latala 
(2(JU5)). For any finite n x m, where n < m, matrix of independent mean zero 
r.v.'s Oij, such that the maximum variances of any entry is at most , and 
each entry has a finite fourth moment B we have 



for an absolute constant C . 

Remark 2.13. The requirement that is upper bounded is not essential. The 
conclusion in Corollary 2.12 works so long as fourth moment is bounded by B. 



Let Msi(A) be the median of si{A). Following a calculation from Meckes 
(2004), we have 

|E[5i(A)]-Msi(A)| < E[|si(A)-Msi(A)|] 



where D < \ for Bernoulli random variables that we consider. This allows us to 




(2.14) 




conclude Theorem 2.4. 



□ 
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2.4- Correctness of classification for the simple algorithm 

We now prove correctness of our algorithm. We first show how to choose T for 
Procedure Classify. Let B denote the fourth moment bound for a single random 
variable in the mean zero random matrix X — X; for the type of normalized 
Bernoulli r.v.s that we care about, ^/B is in the order of cr^, where cr^ is defined 
in Theorem 1.2. 

Let be a large enough constant. Let si{X — X) < CqVk, where Cg = 
CiB^^^ as defined in Theorem 2.4 and let the threshold 

T = ^C^KN-i > 15CoVk, (2.15) 

which requires that 

C3N-/ > 225Co, where C3 satisfies (2.19). (2.16) 
Following Lemma A. 4, (A.l), (A. 3), and Proposition A.l, we have 

\s2iX) - S2(^)| <siiX-X)< CoVk. (2.17) 
We have two cases, 

1. When S2{X) < T, by Lemma 2.7 and the fact that S2{X) < S2{X) + 
si{X -X) <T + CoVk < we have 

r>'^2| ,2 ^ 256r^ ^ 128C3g7gmax ^ USC^KjA ^^^^^ 

for Cniax as defined in Lemma 2.7. We want S2{XY\x2 — < This 
holds so long as ^^^^'^4^'^— < ^§1^ < which is true if 

C3 ^ "T^T^? ^^^^ ^^^^ — ^^^q"" froni this point on. (2.19) 
2048 2048 

It follows from Lemma 2.8 that si{Xf\xi - > Hence by (2.10) 
_ yj > > ^ >-s[^. (2.20) 

Thus the condition of Theorem 2.14 holds with C2 = ^, so long as 

Nl> ^ ' , (2.21) 

due to (2.16) and (2.19); This is a weaker condition than (2.32) for / < ^• 

2. When S2(X) > T, we have S2{X) > S2{X) - siiX - X) > T ~ CoVK > 

This satisfies the condition of Theorem 2.18, with C3 — "'"^^'^ = 

16 V 2 ■ 



A. Blum et al. /Separating populations with wide data: A spectral analysis 



89 



Let us first denote the first singular vector ui and its "noise" vector e as 
follows: 

u{ = {x + Si, . . . ,x + SNi,y + n, . . . ,y + TN2) , = ((Si, . . . , (^jv^ , n, . . . , tatJ . 
It turns out that we only need to use the mixture mean 

to decide which side to put a node, i.e., to partition j £ [2iV] according to 
ui,j < M or Ml J > M, given that N1/N2 is a constant; Misclassifying any entry 
will contribute 17 (2^) amount to — ui\\^. 

Theorem 2.14. Assume w.l.o.g. that Ni < N2 and 2N < K. Let uji = Ni/2N 
and UJ2 = N2/2N . Suppose \xi — yi\ > C2\/ ^ for some constant C2 — ^. By 

requiring N > ^"3^^^ ^ as in (2.21), and 

Ni > , or eqmvalently 2N > ^ = ^, 2.23) 

jc^^U}iUJ2 ic^^UJ2i^{ fc4"fUJ2L^( 

where ci = for Ci specified in Theorem 2.4 and cq specified in Proposi- 

tion 2.5, we can classify the two population using the mixture mean M with the 
error factor at most f for Ni, N2 respectively whp. 



By Lemma 2.1 and Theorem 2.4, we immediately have the following claim. 



Claim 2.15. For c\ chosen as in Theorem 2.14, W^Wl — J^f^i ^1 + Y^^=i '''f < 



N 



Proof. Given that ci ~ such that Ci appears in Theorem 2.4 and cq 

appears in Proposition 2.5, 



Ni N2 

\H^f+T.^f = 11^.1- fillip ~ 201 ~ 2 sin(0i) 
\ i=i i=i 

^ 4:Si{X ~ X) ^ 4Ci Vb^/K _ cia 
gap(l,A') - AcaV2NK/5 ^ Vn' 

This allows us to conclude the claim. □ 

We need the following lemma, proof of which appears in Appendix B. 

Lemma 2.16. Assume that 2N < K and Condition (2.23) in Theorem 2.14, 
we have 

M-x>^ ^, y^M>^ yLini 1. 2.24 
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Proof of Theorem 2. 14- Recall that the largest ui , U2 have the form of [a;, . . . , a;, 
y, ...,?/], where x repeats A^i times and y repeats N2 times; hence w.l.o.g., 
assume that x < y, we have 

Vi, s.t. x + d^> M, it contributes S"^ > \M - >« ' to \\e\\l , (2.25) 



Vi, s.t. y-n< M, it contributes 5f > \M - >w ^ l' to ||e||2 . (2.26) 



8iV3 11HI2 



Hence the total number of entries that goes above M from Pi, and those goes 
below M from P2 can not be too many since their total contribution is upper 
bounded by ||e||2 = — mi||2- Let £1 be the number of misclassified entries 
from A^i, i.e., those described in (2.25), by Lemma 2.1, 

h^<h\M-x\'<\\e\\l<'^. (2.27) 

given that Ni > > ^'^^'^ 

We next bound the number of entries from P2 that goes below M, which can 
not be too many either; let £2 be the number of misclassified entries from P2, 



Thus given that A^i > -f^j— > , j a ; hence it suffices to guarantee that ii < 



<^2|M-y|2<||e||^< J— , (2.28) 



z _ .1 «| _ II ,12 _ ^ 

hence by requiring 



f^ich 

Condition (2.29) is equivalent to 



N2 > -P^, (2.29) 



it suffices to guarantee that £2 ^ ^1% ^ f^2- 

^2 ^2*^ 



N2UJ1 ^ 2cfcr^ 

UJ2 ~ fuJlUJ2cij' 



Ni = > ^ , 2 ^ (2-30) 



Thus by requiring 



2r2n-2 

iVi > . f' , (2.31) 

we have satisfied all requirements. □ 
Combining Lemma 2.1 and Corollary 2.9, we have 

Claim 2.17. Given that S2{X) > csy/KN^, \\u2 ~ mWl < < 
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This aUows us to prove the foUowing theorem. Let the classification error 
factor be the number of misclassified individuals from one group over total 
amount of people in that group. 

Theorem 2.18. Assume Ni < N2 and 2N < K. Let uji = Ni/2N and UI2 = 
N2/2N . Let S2{X) > csy^KN^, where C3 — j^y^^^ o-nd oj-min is the minimum 
possible weight allowed by the algorithm. By requiring 



/7 V^l / 

we can classify the two population using to separate components in U2, with 
error factor at most f for both Pi, P2 whp. 



Proof. Let £1 , £2 be the number of misclassified entries from Pi and P2 respec- 
tively; they each contribute at least , and '^^j^'" amount to \\u2 — 'S2II2J 



and hence by Claim 2.17, 

< \\u.-U2\\l<^^^<^. (2.33) 
27V - II - ^112 - c^i^AT^ - clN-f ^ ' 

Hence h < < fNi given that TV > ig(4^ + l). 

Similarly, by Claim 2.17, we have £2 '^2jv'" — il'"2~M2|l2 and thus £2 < 
< ^ so long as iV > ^(4^ + l); the bound on 2N follows by 

'^3T'-'y mill J ^3J'^ ^1 

plugging in C3 = □ 
Finally, 

Theorem 2.19. Given a set of n > 17 ( ^,J^ ) individuals, by trying Proce- 
dure Classify for log n rounds, with probability of error at each round for each 
individual being f = 1/10, where each round we take a set of K > n independent 
features, and by taking majority vote over the different runs for each sample, 
our algorithm will find the correct partition with probability 1 — 1 /n^ . 

Proof. A sample is put in the wrong side with a probability 1/10 at each round. 
Let £i be the event that sample i is misclassified for more than logn times, thus 
Pr[fi] = ( jly) < 1/n'^ '^^; hence by union bound, with probability 1 — l/n^, 
none of the 2N individuals is misclassified. □ 



3. The algorithm Partition 
3. 1 . Preliminaries 

Let V = {1, . . . , n} be the set of all n individuals, and let ip : V ^ {1, . . . ,k} 
be the map that assigns to each individual the population it belongs to. Set 

Vt = i^^^it) and Nt = \Vt\. Moreover, let E = '^vi)i<v<n.\<i<K be the n x isT 



A. Blum et al. /Separating populations with wide data: A spectral analysis 



92 



matrix with entries Kyi ~ For any 1 < t < fc we let E^'* = (Pt)z=i,....-ftr be 

the row of E corresponding to any v CzVt- In addition, let A ~ (a„i) denote the 
empirical n x K input matrix. Thus, the entries of E equal the expectations of 
the entries of A. 

As in Theorem 1.3, we let 

7 = R-^ min IIE^^' - E^'^' Ip, T = Kj. 

l<i<j<k 



Further, set A = v Kcr. Then the assumption from Theorem 1.3 can be rephrased 
as 

nmi„r > CkX^ and T > R-^ (3.1) 

where Ck signifies a sufficiently large number that depends on k only (the precise 
value of Ck will be specified implicitly in the course of the analysis). As in 
the previous section, by repeating the partitioning process logn times, we may 
restrict our attention to the problem of classifying a constant fraction of the 
individuals correctly. That is, it is sufficient to establish the following claim. 

Claim 3.1. There is a polynomial time algorithm Partition that satisfies the 
following. Suppose that (3.1) is true. Then whp Partition(A, k) outputs a par- 
tition (Si, . . . , Sk) of V such that there exists a permutation a such that 



^|F,A5,(,)| <0.001n: 



min- 



Let X = {xij)i<i<n,i<j<K he a n X K matrix. By Xi we denote the i'th row 
{Xii, . . . , XiK) of X. Moreover, we let 

\\X\\ = max 11X^11 

signify the operator norm oi X . A rank k approximation of X is a matrix X of 
rank at most k such that for any n x K matrix Y of rank at most k we have 
11^ — ^11 ^ - Given X, a rank k approximation X can be computed as 

follows. Letting p = rank(X), we compute the singular value decomposition 

p 

i=l 

here (^i)i<i<p is an orthonormal family in R", {rii)i<i<p is an orthonormal 
family in R^, and we assume that the singular values are in decreasing order 
(i.e., Al > • • • > Ap). This can be accomplished in polynomial time within any 
numerical precision. Then X = ^^^f^^^l KS.iflf is easily verified to be a rank 
k approximation. 

In addition to the operator norm, we are going to work with the Frobenius 
norm 

' n K 



X^ 
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Although the following fact is well known, wc provide its proof for completeness. 
Lemma 3.2. If X has rank k, then < 

Proof. Let X = X]i=i ^i^ivT be a singular value decomposition as above. Then 

fc 

Since and r]i,...,rik arc orthonormal families, we have = 

= 1 if i = j and (^i, Cj) = (jyi, = if i ^ j. Hence, \\X\\l, = Y!1=i A- • 
This implies the assertion, because Ai < for all 1 < j < fc. □ 

3.2. Description of the algorithm 
Algo 3.3. Partition(A, fc) 

Input: A n X K matrix A and the parameter fc. Output: A partition Si, . . . , Sk 
oiV. 

1. Compute a rank k approximation A of A. 

For j = 1, . . . , 2 log A' do 

2. Let = K2-^ and compute Q(-''(u) = {u. £ F : - < O.Oir^} for 
all veV. 

Then, determine sets Q[^\ . . . , Q^/' as follows: for i = 1, . . . , fc do 

3. Pick V eV\ [j]Zl Qp' such that |Q'^'(«) \ UjlJ Qp'l is maximum. 
Set Q,p) = \ U::J and C?) = ^ E.^qU) 

4. Partition the entire set V as follows: first, let S^'^ = Qp' for all 1 < i < fc. 
Then, add each v £ V \ ULi <3P' ^o a set Sp' such that \\A^ - ^'^^H is 
minimum. 

5. Let J be such that r* = r j is minimum. Return Sj"'' , . . . , S^f^ . 

The basic idea behind Partition is to classify each individual v € V accord- 
ing to its row vector Ay in the rank fc approximation A. That is, two individuals 
v,w are deemed to belong to the same population iff \\Ay — AyjW^ < O.OIF^. 
Hence, Partition tries to determine sets Si, . . . , Sk such that for any two v, w 
in the same set Sj the distance ||^^ — Ay,\\ is small. To justify this approach, 
we show that A is "close" to the expectation E of ^ in the following sense. 

Lemma 3.4. There is a constant C > such that Eugy ll^" ~ IP ^ CkX^ 
whp. 

Proof. Since both A and E have rank at most fc, and as therefore A — E has 
rank at most 2k, Lemma 3.2 yields 

^||1„-E„||2 = ||l-E|||<2fc||l-E||2. 
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Furthermore, ||1-E|| < A|| + < 2||E-yl||, because A is a rank k 

approximation of A. As Theorem 2.4 impUes that || A— Ep < for a certain 

constant C > 0, we thus obtain J^vev W^v - E^jj^ < 8k\\A - < CkX^. □ 

Observe that Lemma 3.4 imphes that for most v we have Et,|p < lO^^F, 
say. For letting z = \{v: \\A,, - E,„||2 > lO-^F}], we get 

lO-^Fz < \\Ay - E„||2 < CkX^, 

whence z <C rimin due to our assumption that nminF 3> kX^. Thus, most rows of 
A are close to the corresponding rows of the expected matrix E. Therefore, the 
separation assumption n > from Theorem 1.3 implies that for most pairs 

of elements in different classes v & Vi, w € Vj the squared distance || ^t, — A^^W^ 
will be large (at least 0.99F, say). By contrast, for most pairs u,v G Vi oi 
elements belonging to the same class \\Ay — A.^W^ will be small (at most O.OIF, 
say), because Ey = Eu,. 

As the above discussion indicates, if the algorithm were given F as an input 
parameter, the procedure described in Steps 2-4 (with Fj replaced by F) would 
yield the desired partition of V. The procedure described in Steps 2-4 is very 
similar to the spectral partitioning algorithm from McSlicrry (2001). 

However, since F is not given to the algorithm as an input parameter, Parti- 
tion has to estimate F on its own. (This is the new aspect here in comparison to 
McSlicrry (2001), and this fact necessitates a significantly more involved anal- 
ysis.) To this end, the outer loop goes through 21ogX "candidate values" Tj. 

(i) (1) 

These values are then used to obtain partitions QY , • ■ • j Qk Steps 2-4. More 
precisely. Step 2 uses Tj to compute for each v € V the set Q{v) of elements 
w such that WA^ — < 0.01F|. Then, Step 3 tries to compute "big" disjoint 
Qi \ . . . , Qk \ where each results from some Q{vi). Further, Step 4 assigns 
all elements v not covered by q'^^ , • ■ • , Q^f ^ to that Q'f^ whose "center vector" 
^■"'^ is closest to Ay. In addition. Step 4 computes an "error parameter" Vj. 
Finally, Step 5 outputs the partition that minimizes the error parameter Vj. 

Thus, we need to show that eventually picking the partition whose error term 
Tj is minimum yields a good approximation to the ideal partition Vi, . . . , Vfc. 
The basic reason why this is true is that the "empirical" mean i'f'^ should 
approximate the expectation E^' for class Vi well iff Q^p is a good approximation 
of Vi. Hence, if Q^^^ , . . . , q'^^ is "close" to Fi, . . . , 14, then 

will be about as small as — E|||, (cf. Lemma 3.4). In fact, the following lemma 
shows that if Tj is "close" to the ideal F, then rj will be small. 
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Lemma 3.5. // < Tj < T, then rj < C^k^X^ for a certain constant Cq > 0. 

Wc defer the proof of Lemma 3.5 to Section 3.3. Furthermore, the next lemma 
shows that any partition such that rj is small yields a good approximation to 

Lemma 3.6. Let Si, . . . , Sk be a jjartition and ^i, . . . , ^fc a sequence of vec- 
tors such that J2i=iJ2ves- ~ ^v\\^ < Cok^X^. Then there is a bijection 
S : {1, . . . , fc} {1, . . . , fc} such that the following holds. 

1. Il^i-E^^wf < 0.001F2 for all i = 1, . . . , k, and 
2- Etil^^^^swl <0.001n„,in. 

The proof of Lemma 3.6 can be foimd in Section 3.4. 

Proof of Claim 3. 1 . Since the rank k approximation A can be computed in poly- 
nomial time (within any numerical precision), Partition is a polynomial time 
algorithm. Hence, we just need to show that if < F < X; for then Partition 
will eventually try a Tj such that ^F < T.j < F, so that the claim follows from 
Lemmas 3.5 and 3.6. To see that K^^ < F < A', recall that we explicitly assume 
that F > K^^. Furthermore, all entries of the vectors E^* lie between and 1, 
whence F = max,<j \\E^^ - E^^ V <K. □ 



3.3. Proof of Lemma 3.5 



Suppose that ^F < Tj < F. To ease up the notation, we omit the superscript 

j; thus, we let Si = S^/\ = q\^^ for 1 < i < fc, and Q{v) = Q'^^Hv) for 
V G V (cf. Steps 2-4 of Partition). The following lemma shows that there is a 
permutation tt such that is "close" to E^"'*) for all 1 < i < fc, and that the 
sets Qi are "not too small" . 

Lemma 3.7. Suppose that ^F < Tj < F. There is a bijection tt : {1, . . . , fc} 
{l,...,fc} such that for each 1 < i < k we have \Qi\ > ^|Kr(i)| (^iT'd liCi ^ 
E^'(')||2 < O.IF. 

Proof. For 1 < i < fc we choose 7r(z) so that \Qi n is maximum. We shall 

prove below that for all 1 < Z < fc we have 

||^( -E^"<"||2 < O.IF, (3.2) 
|Q,| > max{|l^,| :ie {l,...,fc}\^({l,...,Z-l})}-0.01n„i„, (3.3) 
|QinKr(i)l > IQil -0.01n,„i„. (3.4) 

These three inequalities imply the assertion. To see that tt is a bijection, let us 
assume that tt{1) = tt{1') for two indices 1 < / < < fc. Indeed, suppose that 
I = min7r-i(0- Then \Qi\ > \V^^ij\ - O.Olriniin by (3.3), and thus \V^^i) \ Qi\ < 
O.ljimin by (3.4). As Qi n Qi> = by construction, we obtain the contradiction 

(3.3) (3.4) 

0.99n„,in < \Qi'\ < l.l\Qi' n < 1.1|14(/) \ Qi\ < 0.11n„,i„. 
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Finally, as tt is bijective, (3.3) entails that \Qi\ > 0.9V.^^) for all 1 < I < k. 
Hence, due to (3.4) we obtain \Qi n V;| > 0.9|(5;| > ^|V^,T(i)|, as desired. 

The remaining task is to establish (3.2)-(3.4). We proceed by induction on 
I. Thus, let us assume that (3.2)-(3.4) hold for all I < L; we are to show that 
then (3.2)-(3.4) are true for Z = L as well. As a first step, we establish (3.3). To 
this end, consider a class Vi such that i ^ 7r({l, . . . ,L — 1}) and let Zi = {v G 

: - E,||2 < O.OOir}. Then 0.00ir(|l/,| - \Z,\) < E.evA^. H^- " E.f < 
\\A — E|||, < CkX^ (cf. Lemma 3.4) whence the assumption (3.1) on F yields 

1^,1 > -0.01n„,in, (3.5) 

provided that Ck is sufficiently large. Moreover, for all v G Zi we have 

Q{v) = {weV:\\A,- A„||2 < O.OlFj} D Z,, (3.6) 

because we are assuming that Tj > T/2. In addition, let w G Qi for some / < L; 
since our choice of i ensures that v € Vi ^ we have 

Vf <||E^''W -E„|| < ||E,-A„|| + ||A„-A„|| + ||e(-A„|| + ||6-IE''"'"l|. (3.7) 

Now, the construction in Step 3 of Partition ensures that || A.^ — < O.IVF. 
Furthermore, ||6 - E^^m\\ < Vt/3 by induction (cf. (3.2)), and ||A„ - E.„|| < 
0.1-\/r, because v S Zi. Hence, (3.7) entails that ||Au, — A„|| > O.l-s/T, so that 
w ^ Q{v). Consequently, (3.6) yields 

n Q; = for all I < L. (3.8) 

Finally, let vl signify the element chosen by Step 3 of Partition to construct 
Ql- Then by construction \Ql\ = |Q(«l) \ \jfji Qi\ > IQ(") \ U/ti' Qi\- There- 
fore, 

^'1 (3.6), (3.8) (3.5) 

\Ql\ > \Q{v) \\jQi\ > \Z,\ > m - O.Oln^in. 

1=1 

As this estimate holds for all i ^ 7r({l, . . . , L — 1}), (3.3) follows. 

Thus, we know that Ql is "big". As a next step, we prove (3.4), i.e., we show 
that Ql "mainly" consists of vertices in VJr(L)- To this end, let 1 < z < fc be 
such that ||E^' — A„^ || is minimum. Let Y — Ql \ Vi. Then for all w e F we 
have ||E^ - A^J > \\E^-' - A„||. Further, since VT < \\E.yj - E^'|| < ||E^ - 
A„x,|l + ||E^' - Ay J < 2||E.^ - Ay J, we conclude that ||E„ - A^J^ > \r. On 
the other hand, as w e Ql, we have || A^ — A^^ |p < O.OIF. Therefore, we obtain 
||A^ - E^||2 > O.IF for aU w e F, so that 

— ^ ^ „ ^ „ Lemma 3.4 

0.1|y|r< ^ ||A^-E^||2 < ||A-E||| < CfcA^. (3.9) 

Hence, due to our assumption (3.1) on F, (3.9) yields that |y| < O.Olnmin- 
Consequently, (3.3) entails that \Vi n Ql\ > 0.99|Ql|, so that i = 7r(L). Hence, 
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we obtain \Ql n V^^l}\ = \Ql H Vi\ = \Ql \Y\ > \Ql\ - 0.01??,ni„, thereby 
establishing (3.4). 

FinaUy, to show (3.2), we note that by construction — < O.OIF 

and \\A^ - < O.Oir for all w e Ql n K-(i) (cf. Step 3 of Partition). 

Therefore, 



< 3 E Ul-A, 



I A, 



< 

Lemma 3.4 

< 



o.06r|Qin^,(i)|+3||2-E||| 

0.06r|QLnK(i)|+3cfcA'. 



(3.10) 



Since IQeI^Vt^i^l)] > 0.9n„iin due to (3.3) and (3.4), (3.10) entails that |llE7r(L) — 
< 0.07r+ < O.ir. Thus, (3.2) follows. □ 

In the sequel, we shall assume without loss of generality that the map tt 
from Lemma 3.7 is just the identity, i.e., 7r(i) = i for all i. Bootstrapping on 
the estimate - E^'|p < O.IF for 1 < i < A: from Lemma 3.7, we derive the 
following stronger estimate. 

Corollary 3.8. For alll <i <k we have ||^i-E^'||2 < lQQ\Q^\^^Y.v&Q \\A- 
EJP. 



Proof. By the Cauchy-Schwarz inequality. 



iic,-E^-ii = iQr' 




E^^ 




El|A„-E^MP 




veQi 






.veQ^ 



1/2 



(3.11) 



Furthermore, as - E^*||2 < O.ir by Lemma 3.7, for all u G \ we have 

II A„ - E^-ip < 2(11 A„ - edP + lie. - E^'i') < (3.12) 

because the construction of Qi in Step 3 of Partition ensures that ||yli, — |p < 
O.OIF. Hence, as ||E„-E^'p > F, (3.12) implies that ||A-Ei,|| > 0.1|| A-E^' ||. 
Therefore, the assertion follows from (3.11). □ 

Corollary 3.9. For all v e Si\ Vi we have || A - 6^11 < 3|| A - Eu||. 
Proof. Let i ^ I and consider a. v G Si O Vi. We shall establish below that 

||A-e.|| < IIA-edl- (3.13) 

Then by Lemma 3.7 ||A-6i|l < || A -E,,|| + ||E,„ -^j || < || A -E„ || + \/r/3, and 

thus\/r< ||E,-E^i < ||A-edl + ll6-iE'''ll + IIA-E,|| <2||A-E,||+fVr. 
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Consequently, we obtain — E„|| > ^VT, so that the assertion foUows from 
the estimate 

(3.13) 

P.-^^ll < ||A„-6|| < l|A„-E,|| + ||E,„-Ci|| 

Lemma 3.7 ^ vP 

< P„-E„|| + ^<3||A„-E„|i. 

FinaUy, we prove (3.13). If v E Si D Vi \ Qi, then the construction of Si in 
Step 4 of Partition guarantees that \\Av — ^i\\ < ||Au — ^;||, as claimed. Thus, 
assume that v £ QiHVi. Then 

- 6il < 0.15\/f [by the definition of in Step 3], 

max{||^, - E^'ll, lie, - E,„||} < ^Vf [by Lemma 3.7], 

Therefore, if ||Au — ^;|| < — then we would arrive at the contradiction 

Vf < ||E^'-E„|| <||E^''-e,|| + ||E,-C;!l + lle. -611 

< ^vf -e.il + 611 < +2||A„-e.ii <o.99%/r. 

Thus , we conclude that || — || > || Ay —611, thereby completing the proof. □ 
Proof of Lemma 3.5. Since jQj j > ^jT^ij by Lemma 3.7, we have the estimate 

k k 

E E < 2E E [iiX-E..ip + iiE,.-e.f' 

1=1 weSinVi i=i weSinVi 

""T' 2||i-Eiii+2ooE^ E w^^-^'^r 

< 500||i-Ej||. (3.14) 
Furthermore, by Corollary 3.9 

k k 

E E P^'-6IP<9E E llX-E,lp<9||l-E|||. (3.15) 

i=l v<^S^\Vi 1=1 veS,\Vi 

Since ||A-E||| < CfcA^ by Lemma 3.4, the bounds (3.14) and (3.15) imply the 
assertion. □ 

3.4- Proof of Lemma 3.6 

Set Sab ~ S'qH 14 for 1 < a, 5 < k. Moreover, for each 1 < a < fc let 1 < 7r(a) < k 
be such that IjE^"*") — ^ajl is minimum. Then for all b ^ 7r(a) we have 

Vf < |1E^'(=) - E^^ll < |1E^'(") - U\ + P"'^ - Call < 2||E^^ - (3.16) 
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SO that IjE^'''' —Call > Vt/2. Therefore, by our assumption that J2i=i J2ves- Mi~ 
A„|p < Cofc^A^ we have 

k k 

E l^'^^l ^ |^a,|-||E^'-eaf 

a=l l<b<k:b^TT{a) a.b=l 

k 

< 2 ^ ^ ||E„-Xir + ||A„-eaf 

a.b—l V^Sab 

k 

< 2||l-E||| + 2 ^ W^^-^^f 

a,b—l V^Sab 

Lemma 3.4 o q o , , 

< ACok^X^ + 2Cok^X^ < C^k^X^. (3.17) 

Hence, 

Y,\SaAV^(a)\= E 2|5a,| < ^^2^ < O.OOln^in, (3.18) 

a=l l<a,b<k:b^TT(a) 

provided that Ck is sufficiently large (cf. (3.1)). Combining (3.17) and (3.18), 
we obtain ^\\E''-i^-> - Call" < \Sa n K(a)| • ||Eu(a) - Call" < c^fc^A^, whence 

2^21,3 \2 

||IE^(a) - Call^ < — < O.OOir for aU 1 < a < fc, (3.19) 

^min 

provided that Ck is large enough. Thus, we have estabUshed the first two parts of 
the lemma. In addition, observe that (3.18) impUes that tt is bijective (because 
the sets Si,...,Sk are pairwise disjoint and |14| > rimin for all 1 < a < k). 
Finally, the third assertion follows from the estimate 

fc fc 
^ |5ab|-||E^'<"' -E^'(^)||2 < 2^ |5afc|(||E^'(")-Ca||' + ||E^'<''-ea|P) 

a, 6=1 a,b—l 

(3.16) 

< 8 J2 l^a6|-||E^"<^)-eaf 

a, 6=1 

(3.17) 

< SCok^X"^ < O.OOirnmin, 
where we assume once more that C'k is sufficiently large. 



4. Experiments 

We illustrate the effectiveness of spectral techniques using simulations. In par- 
ticular, we explore the case when we have a mixture of two populations; we 
show that when NK > 1/7^ and K > I/7, either the first or the second left 
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7=0.0016, Balanced case 




3000 



Fig 1. Plots show success rate as a function of N for several values of K, when ■y = 
(0.04)2. ^^p/j pgj^j 

is an average over 100 trials. Horizontal lines ("oracles" ) indicate the 
information-theoretically best possible success rate for that value of K (how well one could do 
if one knew in advance which features satisfied Pj > '''^'^ which satisfied p'^ < p^; they are 
not exactly horizontal because they are also an average over 100 runs). Vertical bars indicate 
the value of N for which NK = 1/7^. 



singular vector of X shows an approximately correct partitioning, meaning that 
the success rate is well above 1/2. The entry-wise expected value matrix X is: 
among K/2 features, p\ > p\ ^-^id foi' the other half, p\ < p\ such that Vi, 
€ {'^^ + i' + f }' where e = 0.1a. Hence 7 = a^. We report results 
on balanced cases only, but we do observe that unbalanced cases show similar 
tradeoffs. For each population P, the success rate is defined as the number of 
individuals that are correctly classified, i.e., they belong to a group that P is 
the majority of that group, versus the size of the population \P\. 

Each point on the SVD curve corresponds to an average rate over 100 trials. 
Since we are interested in exploring the tradeoffs of iV, K in all ranges (e.g., when 
N << K or N >> K), rather than using the threshold T in Procedure Classify 
that is chosen in case both N,K > I/7, to decide which singular vector to use, 
we try both ui and U2 and use the more effective one to measure the success 
rate at each trial. For each data point, the distribution of X is fixed across all 
trials and we generate an independent X2Nxk for each trial to measure success 
rate based on the more effective classifier between ui and U2- 

One can see from the plot that when K < I/7, i.e., when K ~ 200 and 400, 
no matter how much we increase N, the success rate is consistently low. Note 
that 50/100 of success rate is equivalent to a total failure. In contrast, when N is 
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smaller than I/7, as wc increase K, we can always classify with a high success 
rate, where in general, NK > 1/7^ is indeed necessary to see a high success 
rate. In particular, the curves for K = 5000, 2500, 1250 show the sharpness of 
the threshold behavior for increasing sample size n from below to above. 

For each curve, we also compute the best possible classification one could hope to 
make if one knew in advance which features satisfied p\ > p\ ^-nd which satisfied 
Pi < -P2- These are the horizontal(ish) dotted lines above each curve. The fact 
that the solid curves are approaching these information-theoretic upper bounds 
shows that the spectral technique is correctly using the available information. 
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Appendix A: More Proofs for the simple algorithm classify 



A.l. Proof of Lemma 2. 1 



Let Ml, . . . , u„, wi, . . . , w„ be the n left and right singular vectors of X, corre- 
sponding to si{X) > S2{X) > ■ ■ ■ > Sn{X), we have for Vi, ||ui||2 = 1. 11^2112 = ^ 
such that X"^Ui = Si{X)vi and Xvi = Si{X)ui. 

Before we prove Lemma 2.1, given an n x A" matrix X, where n < A, let us 
first define H = XX^ and a block matrix 



Y 



X 
X^ 



(A.l) 



{2N+K)x{2N+K) 



Recall that singular values of a real n x K matrix X are exactly the non- 
negative square roots of the n largest eigenvalues oi H — XX^, i.e, Si{X) = 
\J Xi{H), Vi — 1, . . . ,n, given that 



Hu, 



XX^u, = s,{X)Xv, = s1{X)u,. (A.2) 
, Un of X are eigenvectors of H correspond- 



Hence the left singular vectors mi, 
ing to K{H) = s1{X). 

We next show that the first n eigenvalues of Y and their corresponding eigen- 
vectors: 



Y 



Vi 



X 
X^ 



Ml 



Xv, 
X^u,. 



s,{X)u, 
Si{X)vi 



= s,iX) 



(A.3) 



and hence 



Proposition A.l. The largest n eigenvalues of Y are si(X), . . . , s„(X) with 
corresponding eigenvectors [ui,Vi\,yi = l,...,n, where Ui,Vi,yi, are left and 
right singular vectors of X corresponding to Si{X). 
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In fact both ±Si(X) arc eigenvalues of Y, which is irrelevant. 

Proof of Lemma 2.1. We first state a theorem, whose statement appears in 
a lecture note by Spiclman (2002), with a slight modification (off by a factor 
on RHS). Our proof for this theorem is included here for completeness. It is 
known that for any real symmetric matrix, there exist a set of n orthonormal 
eigenvectors. 

Theorem A. 2 (Modified Version of Spiclman (2002)). For A and M being two 
symmetric matrices and E ~ M ~ A. Let \i{A) > X2{A) > • • • > A„(^) be 
eigenvalues of A, with orthonormal eigenvectors vi,V2, . . . ,Vn and let Ai(M) > 
A2(M) > ••• > Xn{M) be eigenvalues of M and wi,W2, . . . ,Wn be the corre- 
sponding orthonormal eigenvectors of M , with 6i ~ Z{vi,Wi). Then 



\E-\I\\ 
gap{i,A) 



< 



\E\\ 



< 



2\\E\\ 



gap{i,A) gap{i,A) 



(A.4) 



where gap{i, A) = miuj^i |Ai(^) — Aj(y4)| and Ai = Xi(M) — Xi{A). 



Let us apply Theorem A. 2 to the symmetric matrix Y in (A.l). In particular, 
we only compare the first n eigenvectors of Y of 3^. For the numerator of RHS 



of (A.4), we have E = Y-y, and \\E\\ 



|y — 3^112 = si (Y — y) by a derivation 



similar to (A.3), where eigenvectors of E are concatenations of left and right 
singular vectors oi X — X; For the denominator, we have by Proposition A.l, 
gap(i,3^) ^ minj^, |A.(y) - A,(3^)| = miuj^,; |s,(^) - s,(A')| . □ 

Wc first prove the following claim. 

Claim A.3. For any symmetric n x n matrix A, let Xi,yi ^ 1, . . . ,n be eigen- 
values of A with orthonormal eigenvectors vi,V2, . . . ,Vn, for all y J- Vi, 

\\{A-XM2 > min|A, - Ajl ||y||2. 

Proof. Let us first assume y _L and write y ~ j^i^jUj, thus we have 

Ily|l2 = vSZ^and 



\iA-XM2 = 



c,{A-X,)i 

n 

E Cj(Aj-A,) 



> min I Ai — Xj \ 

j=^i 



□ 
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Proof of Theorem A. 2. Let us construct a vector y that is orthogonal to Vi as 
follows: 



y ^Wi- {vjw{)vi 



By Claim A. 3, we have 



\\{A - h{A))y\\^ > min |A,(A) - A,(A)| \\y\\^ , 



and hence 



\yh 



< 



\iA-KiA))y\\ 



\XM)-^M)\ 



On the other hand, 

\\iA-\,iA))y\\2 = 



< 



(A - X,{A)){w, - {vfw^)v^)\\^ 
{A - X,{A))w,\\2 
{M - E - K{A))w,\\^ 
(X^iM) - X,{A))w,^ Ew,\\^ 
{AJ-E)w^\\^ < 11^;- A,/||2 
i^ll2 + |A^| 



Finally, given that WwW^ = 1, 
sin(6'j) = 



< 



\{A^XM))y\\ 



< 



\H\2 min^-^i |Ai(A) - Aj(A)| 
ll£;|U + |A,,| 



3p(j,^) 



Lemma A. 4. Vi = 1, . . . ,n, |Ai| < ||£'||2. 



Proof. Let Sj be a subspace of dimension j. Recall the following definition of 
Xi for a matrix: 



Xi{M) = inf sup 



x^Mx. 



(A.5) 



In the following, let S'^_^j^i be the subspace that is orthogonal to the subset of 
orthonormal eigenvectors fi, . . . , Vi-i of symmetric matrix A. Note that this is 
the — t + 1 dimensional subspace that achieves the minimum of the maximum 
of Av over all unit-length vectors v in the particular subspace. We have 



Xi[M) = inf sup 

SiV-.+ l^gSjV-> + l.!bj|2 = l 



x^Mx < 



sup 



x'^Mx 



< sup x'^{A + E)x 



< sup v"^ Av + sup 

- XM) + \\E\\2- 
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For the other direction, let S'^_^^l be the subspacc that is orthogonal to the 
subset of orthonormal eigenvectors wi,. . . jWi-i of symmetric matrix M. Note 
that this is the N — i + 1 dimensional subspace that achieves the minimum of the 
maximum of vP" Mw over all unit-length vectors w in the particular subspace. 
We have 

\i{A) = inf sup Ax < sup x^Ax 

< sup x'^{M + {-E))x 

^S'S'N-.i + l^l!^lt2 = l 

< sup uF AIw + sup x'^ {^E)x 

< sup w'^ Mw + sup \x'^{~E)x\ 

wGS^ _^^^,\\w\\2 — l x^B.^ ,\\x\\^ — l 

= K{M) + \\E\\^, 

where \\E\\^ = \\-E\\^. Thus -\\E\\^ < \,{M) - X,{A) < \\E\\^, and |A,| < 
\\E\\2- □ 

Therefore, sm{0,) < < □ 

A. 2. Some Propositions regarding the static matrices 



X 
XT 



For static matrix Ti, ~ XX and y 

gap(7^) = \\i{n) ^ \2{n)\ 



, we define 



My) - |Ai(3^)-A2(3^)| - 



\i{y) + \2{yy 



Proposition A. 5. For static matrix y , let gap{y) = |Ai(3^)— A2(3^)| = \i^)+\l{y) ' 
we have 



yJmaK{Nia, iVac} < \i{y) < y/N^a + N2C, 



yjNia + N2C < Xi{y) + X2{y) < x/2{Nia + iVzc), 



Nia + N2C - 2m - y iVia + iVac ^ 



\y/Nia + N2cJ \ y/Nia + N2C 

Proof. We first show the following: 
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Proposition A.6. For static matrix H = XX'^ as in (2.12), Let Xi{n), \2(U) 
he the non-zero eigenvalues ofH, and denote gap{Ti.) ~ |Ai(7i) — A2(7i)|. 



Nia + N2C + ^J{N^a - iVsc)^ + 4A^iA^2&2 
2 ' 
Nia + N2C - y/{Nia - TVac)^ + 4A^iA^2&2 



(A.6) 

(A.7) 
(A.8) 



N2C. 



\Nia ~ N2c\ < gap{n) < Nia + N2C, 
where \2{'H) — 0, when ac — and gap{T-C) = Nia - 

Proof. Let H = XX^ . The rank of H is at most 2. Therefore there exist at most 
two non-zero eigenvalues Ai, A2 for 7i, with corresponding nonzero eigenvectors 
ui,W2 being constant on each population. This is true because if we multiply 
"H — XX^ by a permutation matrix P to exchange two rows among the same 
population, we have PTLvi = XiPvi^Wi = 1,2; given that PTLvi ~ "Hvi, we 
deduce that Pvi — Vi for non-zero A^. Hence Vi must be constant on each 
population. 

Let the top two eigenvector vi,V2 be of form [x, . . . ,x,y, . . . ,y\, where x 
repeats A'^i times and y repeats N2 times; Note that they corresponds to ui and 
U2 of X following a derivation similar to (A. 2). 

We thus have the following equations: 



Niax + N2by 
Nibx + N2cy 

which can be written in a matrix form: 



Given that 



Ax, 
Xy, 



(A.9) 
(A.IO) 



Nia - X N2b 




x 


Nib N2C-X 




y 



= 



^0, 



the matrix is not one-to-one and therefore 



Nia - X N2b 
Nib N2C - X 



= 0. 



By solving (iVia - A)(iV2C- A) - iVi A^26^ = 0, we get Ai(7^), A2(7^) and gap(7^). 
We next derive an upper bound on gap(7Y). 



gap(H) = y^{Nia- N2c)^ +miN2b'^ 

= y/{Nia + N2CY - ANiN2ac + ANiN2b'^ 

< ^J{Nia + N2cY 

< Nia + N2C, 



(All) 
(A.12) 
(A.13) 
(A.14) 



where a, c > and ac > b^ as in Proposition 2.10. Hence gap(7Y) > |A^ia — N2c\, 
given that 6^ > 0. □ 
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Thus we have 



ma.x{Nia, iVsc} < Xi{n) < Nia + N2C, 

< A2(7i;) < min{iVia,7V2c}, 
Nia + N2C, 
NiN2{ac-b^), 



Xi{n) + \2{n) 
\i{n)\2{n) 



(A.15) 
(A.16) 
(A.17) 
(A.18) 



Given two largest eigenvalues of y, Ai(3^) = ^ \i{H) and A2(3^) = \/)^i{'H) for 
?c 

, we get all inequalities, by Proposition A. 6 and the following: 



y = 



v(m3^F+wF) < My) + My) = VW^W+mW)- 



□ 



A . 3. Proofs of Proposition 2. 5 and 2. 6 

Proof of Proposition 2.5. We rewrite Proposition A. 5 given that, for a normal- 



ized X, gap(W) > ^21^ 

gap{X) = gap(y) = 



g , as Proposition A. 8 and Xj{y) 

gap(W) 



j{X). In particular, 



Ai(3^) + A2(3^) 



> 



> 



gMn) 



> 



ScqNK 



^Nia + N2C 5V2NK 



For the upper bound on gap(A'), we have that 
gap(A') = gap(y) = 



gap(W) 



< 



Ai(3^) + A2(3^) 
^^« + ^^^ <V2NK 



y^Nia + N2C 

Definition A. 7. For our application, we have Vfc, 1 > p\,P2 > 0, and 



□ 





2 


2 


i+pf 1 
2 






l+p} 
2 

I+P2 
2 


2 
2 


i+pf 
2 

2 






L 2 


2 


2 J 


2NxK 



It is easy to see that with this normalized random matrix, Ai(7i) = X2(Tf.) is 
not possible, given that a,b,c G [K/4, K]; furthermore, gap(7i) = Q{NK) as in 
the Proposition A. 8. 
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Proposition A. 8. Given TL = XX^ and a,b,c as in (2.8) for any expected 
value mean matrix X , which is not necessarily normalized, 

ScoNK 



gap(n) = ^{Nia - N2cY + 4.NiN2h'^ > 

o 

Hence for a normalized X, gap{'H) = Q{NK) given that a,b,c E [A'/4, K]. 

Proof. For a tighter lower bound of gap(7i) than the obvious \Nia — A^2c|, let 
us assume w.l.o.g. that N2C > Nia. Thus we have 



N2 > 2N-^ (A.19) 



We differentiate two cases: 



• Balanced case: Nia> ■^N2C. 

• Imbalanced case: Nia < ^^2C. 

For balanced case: we have Ni > 4s and hence 

gap(W) > ^AN,N2h^ > IM^ /£ 

5 y a 

^ 8N\b\ a ^ m\b\ 

~ 5 a + c\ a ~ 5 a + c 
ScqNK 
5 ' 

where N2 > 2N-^ as in (A.19). 

For the imbalanced case, given that ^/ac > \b\ by Proposition 2.10, 

gap(H) > V(^i«-^2c)2 > 

25 

^ 42 Nac ^ 8 N\b\y^ ^ ScqNK 
~ 25a + c~5 a + c ~ 5 

Finally, for a normalized random matrix X and its X, we have Cq being a 
constant and combing with the upper bound of gap(7i) < Nia + N2C < 2NK 
concludes that gap(7i) = e(iVii'). □ 

Proof of Proposition 2.6. By (A. 2), wi, U2 are the first and second eigenvectors 
of TL corresponding to Xi{H) and \2{'H). Let x,y be entries that correspond to 
Pi, P2 respectively in the first or second eigenvectors of TL. By (A. 9) and (A. 10), 
we have 

y A - Nia Nib 



X N2h A - N2C 

In addition, given any & ^ 0, we have gap(H) > \Nia~ N2c\ and hence Xi{H) > 
max{A^ia, A2C} > X2{A). Therefore, for 6 > 0, ^ > for first eigenvector and 
< for V2. and for 6 < 0, it is the opposite. □ 
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A. 4- Proof of Lemma 2.7 



Proof of Lemma 2.7. We first show that \x2\, I2/2I are withm a constant factor 
of each other, given that loi/uj2 = ^ is a constant. 

Proposition A. 9. For a normalized X, where Ni, N2, a,b 7^ 0, X2,y2 in the 
second top left singular vector U2 satisfy 



2N2 ^ \X2\ ^ N2 

Ni - \y2\- 2Ni 

Proof. By (A. 9) and given the upper bound on gap(7i) in (A. 8), 

1^2! ^ Nia - A2 ^ Aia-iV2C + gap(7^) ^ Nia 

\X2\ N2b 2N2b - N2b' ^ ' ' 

and hence > By (A. 10) and (A.8), we have 

\X2\ ^ N2C - A2 ^ jV2C-7Via + gap(7i) ^ N2C 

1^2 1 Nib 2Nib - Nib ^ ' 

We finish the proof by observing that ^ < ^ < 2, ^<f<2:, due to the 

fact that i < ^ < 2, Vj = 1, . . . , A' for e [1/2, 1] in a normahzed X , and the 
following lemma: 

Lemma A. 10. If < Cmin < < CmaxjVi — l,...,n, where ai,bi > 0, then 

r ■ < -^'=1 1 < r 

Proof. Cmin = — yyi — 5: 'y ^'r'" ^ < — = Cmax- CH 

□ 

Let a; = a;2 and y = 2/2- By Proposition A. 9, \y\ < ^^j^^ and 

,.9 ,r 9 ,.9 /2|x|Ai\^ ,/4A? + AiA2 

1 = Ai.t2 + 7V22/' < Aix^ + A2 -^-^ ] <x^l 1 



A2 y - V ^2 



hence for C 



x mm 



^2 > L. (A.22) 



Looking in the other direction, by Proposition A. 9, |a;| < 



2|y|JV2 



Al / - " V Al 
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and hence for a given C, 



\y\'> 



1 



+ UJ1UJ2 2iV 



On the other hand, by Proposition A. 9, we have \y\ > ^jpf^, we have 



1 = Nix^ + N2y^ > Nix^ + N2 



\x\N^ 
2N2 



27V: 

t2 



N( + AN1N2 
47V2 



and thus < ^np^JjT^^ Looking in the other direction, by Proposition A. 9, 



\x\ > 



\y\N2 

2Ni ' 



1 = Nix' + N2y^ > N2y^ + Ni 



\y\N2 
2N, 



> 



4:N2Ni + 
4/Vi 



and hence IwP < ■> ff'^ — Hence we have that 
\x-y\' = i\x\ + \y\r < 



AUJ2 



1 



1 



uj{ + 4wia;2 2N y + 4a;iW2 2iV 

/ : I \ 2 



< 



4W2 



2N I y + 4wicj2 V ^2 + ^^1^2 I 
and Cniax = (a/— + \ —Y- Hence 



4^2 



4tJi 



LJj + 4wiCJ2 V ^2 + 4(jJiW2 



< 



1 /I 
Ci)! V <^2 



□ 



Appendix B: Proof of Lemma 2.16 



Recall that the largest left singular vectors ui , U2 has the form of [x, . . . , x, ?/, . . . , y] , 
where x repeats Ni times and y repeats N2 times. Proof of Lemma 2.16. Let 
us define the following random variables. 



1 



— T 

No ^ 



(B.l) 



such that by Claim 2.15, 



\5\ 



< 



Cl(T 
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and hence 



max(|7Vi,5|,|7V2r|) < 



N 



(B.2) 



given that we always assume that N2 > Ni. A natural classifier to separate 
individuals would be: when we use Ui; but we do not have access to x and 
y. Recall that 



^N2 



M 



2N 



2N 



2N 



We are now ready to show that when A^i, N2 are large enough, we see enough 
separation between the mixture sample mean and both x and y. We first prove 
the following claims. 



Claim B.l. xNiS + yiVar = 

Proof. This claim is obvious given that ||ui||2 = IIU1II2 = li ^^'^ '^ii ''^ii ^ 

2 2 2 2 2 

being real vectors, WuiW^ = ||fii||2 + ||e||2 + 2 < ui,e >= j|Mi||2 + ||e||2 + 2xNiS + 
yN2T. □ 

We next use ^^^= \xNiS + yN2T\ to obtain a bound on | ^^^^^^^ | , given that 



/2N 



\xNid + yN2T\ < 



2V2N 2NV2N 



(B.3) 



Claim B.2. Let Ni < N2, and uji = o,''^d 1^2 = y^, and given that Ni > 



C27 ' 7 UJl 



we have 
N16 + N2T 



2N 



< 



Ni \y-x\^ 

2N 



(B.4) 



max(x,y) 1 
V2N ^ 2N- 



Proof. We next derive a bound on ^'-^^^^'^ . By Separation Lemma 2.8, we have 
|a; — j/| = 02-^/2^ for a constant C2 = 1/2, and thus we have 
Therefore, 

\xNiS + yN2T\ 



> 



V2N 

|max(.T, y){NiS + N2T) + [x ~ max(a;, y))NiS + [y — max(a;, y))N2T\ 

V2N 

\max{x, y)\ {N^S + N2T) \x - y\ max(|7Vi(5|, \N2t\) 



'2N 



'2N 
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Thus we have, given (B.2), (B.3) and (B.5), 



N16 + N2T 



2N 



< 



< 



< 



< 



|max(x,t/)| 
V2N 



\{Ni5 + N2t)\ 



\xNiS + yN2T\ |x- y|max(|iVi(5|, lA^srl) 



/2N 



/2N 



2NV2N 2N ^ \l N 
N1C21 ^ Ni\y~x\^ 



2NV2N 



2N 



where 



2NV2N ^ AN^/2N 



^ holds so long as A^i > — ^ — , and (B.5) 



C27 



C2^/lci(J IN2 N1C2J , , , , 8cf u;2 , 

■ ' < ; holds so long as A*! > , so that (B.o) 



2N \ N mVrn 



7 ^1 



Ni > 



2V2cicr//V2 



Both conditions are guaranteed by (2.23) in Theorem 2.14. 
This allows us to conclude that 



(B.7) 
□ 



Nix + NiS + N2y + N2T Nix + N2y 



2N 



2N 



N1S + N2T 



2N 



mm{Ni,N2}^ \ 
< I — l|a;-y| 



2N 



Given that — a;| = C2 y/j/\/2N as shown in the Separation Lemma 2.8, we 
have 



Nix + NiS + N2y + N2T 



2N 



N2{y-x) , Nid + N2T 



> 



> 



> 



2N 

N2{y~x) 



2N 



2N 
Ni5 + N2T 



2N 



N2\y-x\ _ TDin{Ni,N2}^\y-x\ 

2N 2N 
{l^^)N2\y-x\ 
2N 



A. Blum et al. /Separating populations with wide data: A spectral analysis 



112 



and similarly, 

Nix + NiS + N2y + N2T 



y 



2N 



Ni{y-x) N16 + N2T 



> 



> 



> 



2N 

Niiy-x) 



2N 



2N 

NiS + N2T 



2N 



Ni\y~x\ mm{Ni,N2}^\y~x\ 

2N 2N 
(1 - ^)N, \y - x\ 



2N 



□ 
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